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Abstract — Machine learning over fully distributed data poses 
an important problem in peer-to-peer (P2P) applications. In this 
model we have one data record at each network node, but without 
the possibility to move raw data due to privacy considerations. 
For example, user profiles, ratings, history, or sensor readings 
can represent this case. This problem is difficult, because there 
is no possibility to learn local models, the system model offers 
almost no guarantees for reliability, yet the communication 
cost needs to be kept low. Here we propose gossip learning, 
a generic approach that is based on multiple models taking 
random walks over the network in parallel, while applying an 
online learning algorithm to improve themselves, and getting 
combined via ensemble learning methods. We present an instan- 
tiation of this approach for the case of classification with linear 
models. Our main contribution is an ensemble learning method 
which — through the continuous combination of the models in 
the network — implements a virtual weighted voting mechanism 
over an exponential number of models at practically no extra 
cost as compared to independent random walks. We prove the 
convergence of the method theoretically, and perform extensive 
experiments on benchmark datasets. Our experimental analysis 
demonstrates the performance and robustness of the proposed 
approach. 

Index Terms — P2P; gossip; bagging; online learning; stochastic 
gradient descent; random walk 

I. Introduction 

The main attraction of peer-to-peer (P2P) technology for 
distributed applications and systems is acceptable scalability 
at a low cost (no central servers are needed) and a potential 
for privacy preserving solutions, where data never leaves the 
computer of a user in a raw form. The label P2P covers a wide 
range of distributed algorithms that follow a specific system 
model, in which there are only minimal assumptions about the 
reliability of communication and the network components. A 
typical P2P system consists of a very large number of nodes 
(peers) that communicate via message passing. Messages can 
be delayed or lost, and peers can join and leave the system at 
any time. 

In recent years, there has been an increasing effort to 
develop collaborative machine learning algorithms that can be 
applied in P2P networks. This was motivated by the various 
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potential applications such as spam filtering, user profile anal- 
ysis, recommender systems and ranking. For example, for a 
P2P platform that offers rich functionality to its users including 
spam filtering, personalized search, and recommendation fil- 
ls], or for P2P approaches for detecting distributed attack 
vectors Q, complex predictive models have to be built based 
on fully distributed, and often sensitive, data. 

An important special case of P2P data processing is fully 
distributed data, where each node holds only one data record 
containing personal data, preferences, ratings, history, local 
sensor readings, and so on. Often, these personal data records 
are the most sensitive ones, so it is essential that we process 
them locally. At the same time, the learning algorithm has to 
be fully distributed, since the usual approach of building local 
models and combining them is not applicable. 

Our goal here is to present algorithms for the case of 
fully distributed data. The design requirements specific to the 
P2P aspect are the following. First, the algorithm has to be 
extremely robust. Even in extreme failure scenarios it should 
maintain a reasonable performance. Second, prediction should 
be possible at any time in a local manner; that is, all nodes 
should be able to perform high quality prediction immediately 
without any extra communication. Third, the algorithm has 
to have a low communication complexity; both in terms of 
the number of messages sent, and the size of these messages 
as well. Privacy preservation is also one of our main goals, 
although in this study we do not analyze this aspect explicidy. 

The gossip learning approach we propose involves models 
that perform a random walk in the P2P network, and that 
are updated each time they visit a node, using the local data 
record. There are as many models in the network as the 
number of nodes. Any online algorithm can be applied as a 
learning algorithm that is capable of updating models using a 
continuous stream of examples. Since models perform random 
walks, all nodes will experience a continuous stream of models 
passing through them. Apart from using these models for 
prediction directly, nodes can also combine them in various 
ways using ensemble learning. 

The generic skeleton of gossip learning involves three main 
components: an implementation of random walk, an online 
learning algorithm, and ensemble learning. In this paper we 
focus on an instantiation of gossip learning, where the online 
learning method is a stochastic gradient descent for linear 



models. In addition, nodes do not simply update and then pass 
on models during the random walk, but they also combine 
these models in the process. This implements a distributed 
"virtual" ensemble learning method similar to bagging, in 
which we in effect calculate a weighted voting over an 
exponentially increasing number of linear models. 

Our specific contributions include the following: (1) we 
propose gossip learning, a novel and generic approach for P2P 
learning on fully distributed data, which can be instantiated 
in various different ways; (2) we introduce a novel, efficient 
distributed ensemble learning method for linear models that 
virtually combines an exponentially increasing number of 
linear models; and (3) we provide a theoretical and empirical 
analysis of the convergence properties of the method in various 
scenarios. 

The outhne of the paper is as follows. Section HIl elaborates 
on the fully distributed data model. Section |III] summarizes 
related work and the background concepts. In Section HV] we 
describe our generic approach and a naive algorithm as an ex- 
ample. Section|V]presents the core algorithmic contributions of 
the paper along with a theoretical discussion, while Section [VTl 
contains an experimental analysis. Section IVIII concludes the 
paper 

This paper is a significantly extended and improved version 
of our previous work 15]. 

II. Fully Distributed Data 

Our focus is on fully distributed data, where each node in 
the network has a single feature vector, that cannot be moved 
to a server or to other nodes. Since this model is not usual in 
the data mining community, we elaborate on the motivation 
and the implications of the model. 

In the distributed computing literature the fully distributed 
data model is typical. In the past decade, several algorithms 
have been proposed to calculate distributed aggregation queries 
over fully distributed data, such as the average, the maximum, 
and the network size (e.g., IS-JS)). Here, the assumpion is 
that every node stores only a single record, for example, a 
sensor reading. The motivation for not collecting raw data 
but processing it in place is mainly to achieve robustness and 
adaptivity through not relying on any central servers. In some 
systems, like in sensor networks or mobile ad hoc networks, 
the physical constraints on communication also prevent the 
collection of the data. 

An additional motivation for not moving data is privacy 
preservation, where local data is not revealed in its raw form, 
even if the computing infrastructure made it possible. This 
is especially important in smart phone applications iQl- lfTTl 
and in P2P social networking fT2l . where the key motivation 
is giving the user full control over personal data. In these 
applications it is also common for a user to contribute only a 
single record, for example, a personal profile, a search history, 
or a sensor reading by a smart phone. 

Clearly, in P2P smart phone applications and P2P social 
networks, there is a need for more complex aggregation 
queries, and ultimately, for data models, to support features 



such as recommendations and spam filtering, and to make 
the system more robust with the help of, for example, dis- 
tributed intruder detection. In other fully distributed systems 
data models are also important for monitoring and control. 
Motivated by the emerging need for building complex data 
models over fully distributed data in different systems, we 
work with the abstraction of fully distributed data, and we 
aim at proposing generic algorithms that are applicable in all 
compatible systems. 

In the fully distributed model, the requirements of an algo- 
rithm also differ from those of parallel data mining algorithms, 
and even from previous work on P2P data mining. Here, the 
decisive factor is the cost of message passing. Besides, the 
number of messages each node is allowed to send in a given 
time window is limited, so computation that is performed 
locally has a cost that is typically negligible when compared to 
communication delays. For this reason prediction performance 
has to be investigated as a function of the number of messages 
sent, as opposed to wall clock time. Since communication is 
crucially important, evaluating robustness to communication 
failures, such as message delay and message loss, also gets a 
large emphasis. 

The approach we present here is applicable successfully 
also when each node stores many records (and not only one); 
but its advantages to known approaches to P2P data mining 
become less significant, since communication plays a smaller 
role when local data is akeady usable to build reasonably good 
models. In the following we focus on the fully distributed 
model. 

III. Background and Related Work 

We organize the discussion of the background of our work 
along the generic model components outlined in the Introduc- 
tion and explained in Section IIVI online learning, ensemble 
learning, and peer sampling. We also discuss related work 
in P2P data mining. Here we do not consider parallel data 
mining algorithms. This field has a large literature, but the 
rather different underlying system model means it is of little 
relevance to us here. 

a) Online Learning.: The basic problem of supervised 
binary classification can be defined as follows. Let us as- 
sume that we are given a labeled database in the form of 
pairs of feature vectors and their correct classification, i.e. 
{xi,yi), {xn,yn), where Xi £ ffi'', and j/j G {-1, 1}. The 
constant d is the dimension of the problem (the number of 
features). We are looking for a model f : —5- { — 1,1} 
that correctly classifies the available feature vectors, and that 
can also generalize well; that is, which can classify unseen 
examples too. For testing purposes, the available data is often 
partitioned into a training set and a test set, the latter being 
used only for testing candidate models. 

Supervised learning can be thought of as an optimization 
problem, where we want to maximize prediction performance, 
which can be measured via, for example, the number of feature 
vectors that are classified correctly over the training set. The 
search space of this problem consists of the set of possible 



models (the hypothesis space) and each method also defines a 
specific search algorithm (often called the training algorithm) 
that eventually selects one model from this space. 

Training algorithms that iterate over available training data, 
or process a continuous stream of data records, and evolve a 
model by updating it for each individual data record according 
to some update rule are called online learning algorithms. 
Gossip learning relies on this type of learning algorithms. Ma 
et al. provide a nice summary of online learning for large scale 
data 113. 

Stochastic gradient search lfT4l . IfTSi is a generic algorithmic 
family for implementing online learning methods. Without 
going into too much detail, the basic idea is that we iterate over 
the training examples in a random order repeatedly, and for 
each training example, we calculate the gradient of the error 
function (which describes classification error), and modify the 
model along this gradient to reduce the error on this particular 
example. At the same time, the step size along the gradient 
is gradually reduced. In many instantiations of the method, it 
can be proven that the converged model minimizes the sum of 
the errors over the examples |fT6l . 

Let us now turn to support vector machines (SVM), the 
learning algorithm we apply in this paper iflTl . In its simplest 
form, the SVM approach works with the space of linear 
models to solve the binary classification problem. Assuming a 
d dimensional problem, we want to find a d — 1 dimensional 
separating hyperplane that maximizes the margin that sepa- 
rates examples of the two class. The margin is defined by the 
hyperplane as the sum of the minimal perpendicular distances 
from both classes. 

Equation ([T]l states a variant of the formal SVM optimiza- 
tion problem, where w ^ and 6 S M are the parameters 
of model, namely the norm of the separating hyper-plane 
and the bias parameters, respectively. Furthermore, is the 
slack variable of the tth sample, which can be interpreted as 
the amount of misclassification error of the ith sample, and 
C is a trade-off parameter between generalization and error 
minimization. 

1 " 
min -\\wf + CY,£,^ 

s.t. yi('w Xi + b) > 1 ~ S,i and 
> (Vi : 1 < i < n) 

The Pegasos algorithm is an SVM training algorithm, based 
on a stochastic gradient descent approach ifTSll . It directly 
optimizes a form of the above defined, so-called primal 
optimization task. We will use the Pegasos algorithm as a basis 
for our distributed method. In this primal form, the desired 
model w is explicitly represented, and is evaluated directly 
over the training examples. 

Since in the context of SVM learning this is an unusual 
approach, let us take a closer look at why we decided to work 
in the primal formulation. The standard SVM algorithms solve 
the dual problem instead of the primal form ifTTI . The dual 
form is 
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a. ^ — ' z ^ — ^ 
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s.t. > O^iVi — ™d 
i=l 

< a; < C : I <i <n), 

where the variables ai are the Lagrangian variables. The 
Lagrangian variables can be interpreted as the weights of the 
training samples, which specify how important the correspond- 
ing sample is from the point of view of the model. 

The primal and dual formalizations are equivalent, both 
in terms of theoretical time complexity and the optimal 
solution. Solving the dual problem has some advantages; 
most importantly, one can take full advantage of the kernel- 
based extensions (which we have not discussed here) that 
introduce nonlinearity into the approach. However, methods 
that deal with the dual form require frequent access to the 
entire database to update a^, which is unfeasible in our system 
model. Besides, the number of variables a; equals the number 
of training samples, which could be orders of magnitude larger 
than the dimension of the primal problem, d. Finally, there are 
indications that applying the primal form can achieve a better 
generalization on some databases |fT9l . 

b) Ensemble Learning.: Most distributed large scale al- 
gorithms apply some form of ensemble learning to combine 
models learned over different samples of the training data. 
Rokach presents a survey of ensemble learning methods 1201 . 
We apply a method for combining the models in the net- 
work that is related to both bagging |2ri| and "pasting small 
votes" I22I : when the models start their random walk, initially 
they are based on non-overlapping small subsets of the training 
data due to the large scale of the system (the key idea behind 
pasting small votes) and as time goes by, the sample sets grow, 
approaching the case of bagging (although the samples that 
belong to different models will not be completely independent 
in our case). 

c) Peer Sampling in Distributed Systems.: The sampling 
probability for each data record is defined by peer sampling 
algorithms that are used to implement the random walk. 
Here we apply uniform sampling. A set of approaches to 
implement uniform sampling in a P2P network apply random 
walks themselves over a fixed overlay network, in such a way 
that the corresponding Markov-chain has a uniform limiting 
distribution l23l - l25l . In our algorithm, we apply gossip-based 
peer sampling ll26l where peers periodically exchange small 
random subsets of addresses, thereby providing a local random 
sample of the addresses at each point in time at each node. 
The advantage of gossip-based sampling in our setting is that 
samples are available locally and without delay. Furthermore, 
the messages related to the peer sampling algorithm can 
piggyback the random walks of the models, thereby avoiding 
any overheads in terms of message complexity. 

d) P2P Learning.: In the area of P2P computing, a large 
number of fully distributed algorithms are known for calcu- 
lating global functions over fully distributed data, generally 



Algorithm 1 Gossip Learning Scheme 
1: initModelO 

2: loop 

3: wait(A) 

4: p selectPeerO 

5: send modelCache.freshestO to p 

6: end loop 

7: procedure ONRECEiVEMoDEL(m) 
8: modelCache.add(createModel(TO, lastModel)) 
9: lastModel <— m 
10: end procedure 



referred to as aggregation algorithms. The literature of this 
field is vast, we mention only two examples: Astrolabe ||6l 
and gossip-based averaging Q. These algorithms are simple 
and robust, but are capable of calculating only simple functions 
such as the average. Nevertheless, these simple functions can 
serve as key components for more sophisticated methods, such 
as the EM algorithm 1271 , unsupervised learners 1281 or the 
collaborative filtering based recommender algorithms l29l - 
l32l . However, here we seek to provide a rather generic 
approach that covers a wide range of machine learning models, 
while maintaining a similar robustness and simplicity. 

In the past few years there has been an increasing number 
of proposals for P2P machine learning algorithms as well, 
like those in Il33l - l39l . The usual assumption in these studies 
is that a peer has a subset of the training data on which a 
model can be learned locally. After learning the local models, 
algorithms either aggregate the models to allow each peer to 
perform local prediction, or they assume that prediction is 
performed in a distributed way. Clearly, distributed prediction 
is a lot more expensive than local prediction; however, model 
aggregation is not needed, and there is more flexibility in the 
case of changing data. In our approach we adopt the fully 
distributed model, where each node holds only one data record. 
In this case we cannot talk about local learning: every aspect 
of the learning algorithm is inherently distributed. Since we 
assume that data cannot be moved, the models need to visit 
data instead. In a setting like this, the main problem we need to 
solve is to efficiently aggregate the various models that evolve 
slowly in the system so as to speed up the convergence of 
prediction performance. 

To the best of our knowledge there is no other learning 
approach designed to work in our fully asynchronous and 
unreliable message passing model, and which is capable of 
producing a large array of state-of-the-art models. 

IV. Gossip Learning: the Basic Idea 

Algorithm [T] provides the skeleton of the gossip learning 
framework. The same algorithm is run at each node in the 
network. The algorithm consists of an active loop of periodic 
activity, and a method to handle incoming models. Based on 
every incoming model a new model is created potentially 
combining it with the previous incoming model. This newly 
created model is stored in a cache of a fixed size. When the 



Algorithm 2 CREATEModel: three implementations 

1: procedure CREATEMoDELRW(mi, 7TI2) 
2: return update(mi) 
3: end procedure 

4: 

5: procedure CREATEMoDELMU(mi, 7712) 
6: return update(merge(mi, m2)) 
7: end procedure 

8: procedure CREATEMoDELUM(mi, m2) 
9: return merge(update(mi),update(m2)) 
10: end procedure 



cache is full, the model stored for the longest time is replaced 
by the newly added model. The cache provides a pool of recent 
models that can be used to implement, for example, voting 
based prediction. We discuss this possibility in Section [VI] In 
the active loop the freshest model (the model added to the 
cache most recently) is sent to a random peer 

We make no assumptions about either the synchrony of the 
loops at the different nodes or the reliability of the messages. 
We do assume that the length of the period of the loop A 
is the same at all nodes. However, during the evaluations A 
was modeled as a normally distributed random variable with 
parameters ^ = A and = A/10. For simplicity, here we 
assume that the active loop is initiated at the same time at all 
the nodes, and we do not consider any stopping criteria, so the 
loop runs indefinitely. The assumption about the synchronized 
start allows us to focus on the convergence properties of 
the algorithm, but it is not a crucial requirement in practical 
applications. In fact, randomly restarted loops actually help in 
following drifting concepts and changing data, which is the 
subject of our ongoing work. 

The algorithm contains abstract methods that can be im- 
plemented in different ways to obtain a concrete learning 
algorithm. The main placeholders are SELECTPeer and CRE- 
ateModel. Method selectPeer is the interface for the peer 
sampling service, as described in Section |TII1 Here we use the 
Newscast algorithm ||26l , which is a gossip-based imple- 
mentation of peer sampling. We do not discuss NEWSCAST 
here in detail, all we assume is that SELECTPeer() provides 
a uniform random sample of the peers without creating any 
extra messages in the network, given that NEWSCAST gossip 
messages (that contain only a few dozen network addresses) 
can piggyback gossip learning messages. 

The core of the approach is CREATEModel. Its task is 
to create a new updated model based on locally available 
information — the two models received most recently, and the 
local single training data record — to be sent on to a random 
peer. Algorithm |2] lists three implementations that are still 
abstract. They represent those three possible ways of breaking 
down the task that we will study in this paper. 

The abstract method UPDATE represents the online learning 
algorithm — the second main component of our framework 
besides peer sampling — that updates the model based on one 
example (the local example of the node). Procedure CREATE- 



ModelRW implements the case where models independently 
perform random walks over the network. We will use this 
algorithm as a baseline. 

The remaining two variants apply a method called MERGE, 
either before the update (MU) or after it (UM). Method 
MERGE helps implement the third component: ensemble learn- 
ing. A completely impractical example for an implementation 
of MERGE is the case where the model space consists of all 
the sets of basic models of a certain type. Then MERGE can 
simply merge the two input sets, UPDATE can update all the 
models in the set, and prediction can be implemented via, 
for example, majority voting (for classification) or averaging 
the predictions (for regression). With this implementation, all 
nodes would collect an exponentially increasing set of models, 
allowing for a much better prediction after a much shorter 
learning time in general than based on a single model 1211 . 
II22I . although the learning history for the members of the set 
would not be completely independent. 

This implementation is of course impractical because the 
size of the messages in each cycle of the main loop would 
increase exponentially. Our main contribution is to discuss and 
analyze a special case: linear models. For linear models we 
will propose an algorithm where the message size can be kept 
constant, while producing the same (or similar) behavior as 
the impractical implementation above. The subtle difference 
between the MU and UM versions will also be discussed. 

Let us close this section with a brief analysis of the cost 
of the algorithm in terms of computation and communication. 
As of communication: each node in the network sends exactly 
one message in each A time units. The size of the message 
depends on the selected hypothesis space; normally it contains 
the parameters of a single model. In addition, the message 
also contains a small constant number of network addresses 
as defined by the NEWSCAST protocol (typically around 20). 
The computational cost is one or two update steps in each A 
time units for the UM or the MU variants, respectively. The 
exact cost of this step depends on the selected online learner. 

V. Merging Linear Models through Averaging 

The key observation we make is that in a linear hypothesis 
space, in certain cases voting-based prediction is equivalent 
to a single prediction by the average of the models that 
participate in the voting. Furthermore, updating a set of linear 
models and then averaging them is sometimes equivalent to 
averaging the models first, and then updating the resulting 
single model. These observations are valid in a strict sense 
only in special circumstances. However, our intuition is that 
even if this key observation holds only in a heuristic sense, it 
still provides a valid heuristic explanation of the behavior of 
the resulting averaging-based merging approach. 

In the following we first give an example of a case where 
there is a strict equivalence of averaging and voting to illustrate 
the concept, and subsequently we discuss and analyze a 
practical and competitive algorithm, where the correspondence 
of voting and averaging is only heuristic in nature. 



A. The Adaline Perceptron 

We consider here the Adaline perceptron BOl that arguably 
has one of the simplest update rules due to its linear activation 
function. Without loss of generality, we ignore the bias term. 
The error function to be optimized is defined as 



E,{w)^-{y~{w,x)f 



(3) 



where w is the linear model, and (x, y) is a training example 
{x,w G M", y G { — 1,1}). The gradient at w for x is given 

by 

= -{y-{w,x))x 
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that defines the learning rule for [x, y) by 
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where r] is the learning rate. In this case it is a constant. 

Now, let us assume that we are given a set of models 
u>i, . . . , Wm, and let us define w = (wi + . . . + Wm)/m. In 
the case of a regression problem, the prediction for a given 
point X and model w is (w, x). It is not hard to see that 



h{x) = {w,x) = — (Vw,,a;; 

m ^ — ' 

i=0 
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m ^ — ^ 
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which means that the voting-based prediction is equivalent to 
prediction based on the average model. 

In the case of classification, the equivalence does not hold 
for all voting mechanisms. But it is easy to verify that in 
the case of a weighted voting approach, where vote weights 
are given by \{w,x)\, and the votes themselves are given by 
sgn(w,2;), the same equivalence holds: 



^ m 

h{x) = sgn(— V |(w;,.T)|sgn(w;,x)) = 
m ^ — ' 

i=l 
^ m 

= sgn(— {w^,x)) = sgn{w, x). 
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A similar approach to this weighted voting mechanism has 
been shown to improve the performance of simple vote count- 
ing |4lJ. Our preliminary experiments also support this. 

In a very similar manner, it can be shown that updating 
w using an example (x, y) is equivalent to updating all the 
individual models wi, . . . , Wm and then taking the average: 

1 " 

+ T]{y - {w, x))x = — m; + T]{y - {wi,x))x. (8) 
m ^ — ' 

1=1 

The above properties lead to a rather important observation. 
If we implement our gossip learning skeleton using Adaline, 
as shown in Algorithm[3j then the resulting algorithm behaves 
exactly as if all the models were simply stored and then 
forwarded, resulting in an exponentially increasing number of 
models contained in each message, as described in Section HVl 
That is, averaging effectively reduces the exponential message 
complexity to transmitting a single model in each cycle inde- 
pendently of time, yet we enjoy the benefits of the aggressive. 



Algorithm 3 Pegasos and Adaline updates, initialization, and 
merging 



procedure UPDATEPEGASOS(m) 
m.t <— m.t + 1 
7] ^ 1/{X- m.t) 
if y {m.w, x) < 1 then 

m.w <— (1 — i]X)m.w + i]yx 

else 

m.w (1 — riX)m.w 
end if 
return m 
end procedure 

procedure UPDATEADALiNE(m) 

m.w ^ m.w + r]{y — {m.w, x))x 

return m 

end procedure 
procedure initModel 

lastModel.t ^ 

lastModel.w ^ (0, ...,0)'^ 

modelCache . add(lastModel) 

end procedure 

procedure MERGE(mi,m2) 
m.t <— ma.x(mi.t,m2.t) 
m.w {mi.w + m2.w)/2 
return m 

end procedure 



but impractical approach of simply replicating all the models 
and using voting over them for prediction. 

It should be mentioned that — even though the number of 
,, virtual" models is growing exponentially fast — the algorithm 
is not equivalent to bagging over an exponential number of 
independent models. In each gossip cycle, there are only 
independent updates occurring in the system overall (where 
N is the number of nodes), and the effect of these updates 
is being aggregated rather efficiently. In fact, as we will see 
in Section IVII bagging over N independent models actually 
outperforms the gossip learning algorithms. 

B. Pegasos 

Here we discuss the adaptation of Pegasos (a linear SVM 
gradient method fTSl used for classification) into our gossip 
framework. The components required for the adaptation are 
shown in Algorithm [3] where method UPDATEPegasos is 
simply taken from ifTSl . For a complete implementation of 
the framework, one also needs to select an implementation of 
CREATEModel from Algorithm [2] In the following, the three 
versions of a complete Pegasos-based implementation defined 
by these options will be referred to as P2PEGASOSRW, 
P2PEGASOSMU, and P2PEGASOSUM. 

The main difference between the Adaline perceptron and 
Pegasos is the context dependent update rule that is different 
for correctly and incorrectly classified examples. Due to this 



difference, there is no strict equivalence between averaging 
and voting, as in the case of the previous section. To see this, 
consider two models, wi and W2, and an example {x, y), and 
let w = {wi + W2)/2. In this case, updating wi and W2 first, 
and then averaging them results in the same model as updating 
w if and only if both wi and W2 classify x in the same way 
(correctly or incorrectly). This is because when updating w, we 
virtually update both wi and W2 in the same way, irrespective 
of how they classify x individually. 

This seems to suggest that P2PEGASOSUM is a better 
choice. We will test this hypothesis experimentally in Sec- 
tion |Vl] where we will show that, surprisingly, it is not 
always true. The reason could be that P2PEGASOSMU and 
P2PegasosUM are in fact very similar when we consider 
the entire history of the distributed computation, as opposed 
to a single update step. The histories of the models define a 
directed acyclic graph (DAG), where the nodes are merging 
operations, and the edges correspond to the transfer of a model 
from one node to another In both cases, there is one update 
corresponding to each edge: the only difference is whether 
the update occurs on the source node of the edge or on 
the target. Apart from this, the edges of the DAG are the 
same for both methods. Hence we see that P2PegasosMU 
has the favorable property that the updates that correspond 
to the incoming edges of a merge operation are done using 
independent samples, while for P2PegasosUM they are 
performed with the same example. Thus, P2PEGASOSMU 
guarantees a greater independence of the models. 

In the following we present our theoretical results for both 
P2PEGASOSMU and P2PEGASOSUM. We note that these 
results do not assume any coordination or synchronization; 
they are based on a fully asynchronous communication model. 
First let us formally define the optimization problem at hand, 
and let us introduce some notation. 

Let S ^ {{x,,y,) : 1 < i < G G {+1,-1}} be 

a distributed training set with one data point at each network 
node. Let / : M'* ^ K be the objective function of the SVM 
learning problem (applying the LI loss in the more general 
form proposed in Eq. ([T])): 

/(w) = min^||w;||2 + 1 V l{w]{x,y)), 

where £{w; {x, y)) = max{0, 1 — y{w, x)} 

Note that / is strongly convex with a parameter A ifTSll . Let 
w* denote the global optimum of /. For a fixed data point 
{xi, yi) we define 

A, 



h{w) 



\w\\ + £{w;{xi,yi)), 



(10) 



which is used to derive the update rule for the Pegasos 
algorithm. Obviously, fi is A strongly convex as well, since it 
has the same form as / with m = 1. 

The update history of a model can be represented as a binary 
tree, where the nodes are models, and the edges are defined by 
the direct ancestor relation. Let us denote the direct ancestors 
of it;(*+^) as ly^*' and Wj'"*. These ancestors are averaged and 



then updated to obtain w('+^) (assuming the MU variant). 
Let the sequence ui'^'^' , • • ■ , w^*-* be defined as the path in this 
history tree, for which 



w^*' =argmax ^, (,> (,), llw - 



i =0, , 



(11) 



This sequence is well defined. Let (xj, yi) denote the training 
example, that was used in the update step that resulted in w^*) 
in the series defined above. 

Theorem 1 fP2PEGASOSMU convergence): We assume 
that (1) each node receives an incoming message after any 
point in time within a finite time period (eventual update 
assumption), (2) there is a subgradient V of the objective 
function such that ||Vu,|| < G for every w. Then, 



^■'')-/.K)< 



G^Gog(t) + l) 
2\t 



(12) 



where w^') = (w^ -^wf)l1. 

Proof: During the running of the algorithm, let us pick 
any node on which at least one subgradient update has been 
performed already. There is such a node eventually, due to the 
eventual update assumption. Let the model currently stored at 
this node be 



We know that 



\/'^*^/{Xt), where 



(wj*-* + W2*'')/2 and where V*^*-* is the subgradient of /(. From 
the A-convexity of ft it follows that 



Mw'^''>)-ftiw*) + ^\\w^'^~w*\\^< 

< -u.*,V(*)). (13) 

On the other hand, the following inequality is also true, 
following from the definition of iD(*+^\ G and some algebraic 
rearrangements; 



Xt 



< — - w 
- 2 



2\t 



(14) 



Moreover, we can bound the distance of ui'^*^ from w* with 
the distance of the ancestor of u>(*' that is further away 
from w* with the help of the Cauchy-Bunyakovsky-Schwarz 
inequahty: 



||iZ;W-u;*lP 



,(«) 



,.(*) 



< 



G2 



(16) 

Note that this bound also holds for w'-*', 1 < i < t. 
Summing up both sides of these t inequalities, we get the 
following bound: 



^ ^t. it+i) .,,2 , G^^l ^ G^{lo9{t) + l) 



2A — I 

i=l 



(17) 



from which the theorem follows after division by t. ■ 
The bound in ( fTTI i is analogous to the bound presented 
in Us) in the analysis of the Pegasos algorithm. It basically 
means that the average error tends to zero. To be able to 
show that the limit of the process is the optimum of /, 
it is necessary that the samples involved in the series are 
uniform random samples ifTSl . Investigating the distribution 
of the samples is left to future work; but we believe that 
the distribution closely approximates uniformity for a large 
t, given the uniform random peer sampling that is applied. 

For P2PegasosUM, an almost identical derivation leads 
us to a similar result (omitted due to lack of space). 

VI. Experimental Results 

We experiment with two algorithms: P2PegasosUM and 
P2PegasosMU. In addition, to shed light on the behavior of 
these algorithms, we include a number of baseline methods 
as well. To perform the experiments, we used the PeerSim 
event based P2P simulator 



A. Experimental Setup 

e) Baseline Algorithms.: The first baseline we use is 
P2PegasosRW. If there is no message drop or message 
delay, then this is equivalent to the Pegasos algorithm, since 
in cycle t all peers will have models that are the result of 
Pegasos learning on t random examples. In case of message 
delay and message drop failures, the number of samples will 
be less than t, as a function of the drop probability and the 
delay. 

We also examine two variants of weighted bagging. The 
first variant (WBl) is defined as 



N 



l-WB 



i(x,t) = sgn(V(a;,wf^)) 



i=i 



(18) 



From (fTsT l. ( fT4] i. (flST l and the bound on the subgradients, 
we derive 



< Ijit;'-*-' — (15) where N is the number of nodes in the network, and the linear 

models w'*'' are learned with Pegasos over an independent 
sample of size t of the training data. This baseline algorithm 
can be thought of as the ideal utilization of the N independent 



updates performed in parallel by the N nodes in the network 
in each cycle. The gossip framework introduces dependencies 
among the models, so its performance can be expected to be 
worse. 

In addition, in the gossip framework a node has influence 
from only 2' models on average in cycle t. To account for this 
handicap, we also use a second version of weighted bagging 
(wb2): 

min(2*,Af) 

hwB2{x)=sgn{ ^ {x,Wi)). (19) 

1=1 

The weighted bagging variants described above are not 
practical alternatives, these algorithms serve as a baseline only. 
The reason is that an actual implementation would require N 
independent models for prediction. This could be achieved by 
P2PegasosRW with a distributed prediction, which would 
impose a large cost and delay for every prediction. This could 
also be achieved by all nodes running up to 0{N) instances 
of P2PEGAS0SRW, and using the 0{N) local models for 
prediction; this is not feasible either. In sum, the point that we 
want to make is that our gossip algorithm approximates WB2 
quite well using only a single message per node in each cycle, 
due to the technique of merging models. 

The last baseline algorithm we experiment with is PERFECT 
MATCHING. In this algorithm we replace the peer sampling 
component of the gossip framework: instead of all nodes 
picking random neighbors in each cycle, we create a random 
perfect matching among the peers so that every peer receives 
exactly one message. Our hypothesis was that — since this 
variant increases the efficiency of mixing — it will maintain 
a higher diversity of models, and so a better performance can 
be expected due to the "virtual bagging" effect we explained 
previously. Note that this algorithm is not intended to be 
practical either 

f) Data Sets.: We used three different data sets: 
Reuters B3l . Spambase, and the Malicious URLs ifTsl data 
sets, which were obtained from the UCI database reposi- 
tory 1441 . These data sets are of different types including small 
and large sets containing a small or large number of features. 
Table |T] shows the main properties of these data sets, as well 
as the prediction performance of the Pegasos algorithm. 

The original Malicious URLs data set has a huge number of 
features 3,000,000), therefore we first performed a feature 
reduction step so that we can carry out simulations. Note that 
the message size in our algorithm depends on the number of 
features, therefore in a real application this step might also 
be useful in such extreme cases. We applied the well-known 
correlation coefficient method for each feature with the class 
label, and kept the ten features with the maximal absolute 
values. If necessary, this calculation can also be carried out in 
a gossip-based fashion [71, but we performed it offline. The 
effect of this dramatic reduction on the prediction performance 
is shown in Table |I] where Pegasos results on the full feature 
set are shown in parenthesis. 

g) Using the local models for prediction.: An important 
aspect of our protocol is that every node has at least one 



Algorithm 4 Local prediction procedures 

1: procedure predict(x) 

2: w <— modelCache.freshestO 

3: return sign((w, x}) 

4: end procedure 

5: procedure votedPredictCx) 

6: pRatio -s— 

7: for m g modelCache do 

8: if sign((m.ii;, a;)) > then 

9: pRatio <— pRatio +1 

10: end if 

11: end for 

12: return sign(pRatio/modelCache.size()— 0.5) 
13: end procedure 



model available locally, and thus all the nodes can perform 
a prediction. Moreover, since the nodes can remember the 
models that pass through them at no communication cost, 
we cheaply implement a simple voting mechanism, where 
nodes will use more than one model to make predictions. 
Algorithm |4] shows the procedures used for prediction in the 
original case, and in the case of voting. Here the vector x is the 
unseen example to be classified. In the case of linear models, 
the classification is simply the sign of the inner product with 
the model, which essentially describes on which side of the 
hyperplane the given point lies. In our experiments we used a 
cache of size 10. 

h) Evaluation metric: The evaluation metric we focus 
on is prediction error To measure prediction error, we need 
to split the datasets into training sets and test sets. The 
proportions of this splitting are shown in Table |I] In our 
experiments with P2PEGASOSMU and P2PEGASOSUM we 
track the misclassification ratio over the test set of 100 
randomly selected peers. The misclassification ratio of a model 
is simply the number of the misclassified test examples divided 
by the number of all test examples, which is the so called 0-1 
error 

For the baseline algorithms we used all the available models 
for calculating the error rate, which equals the number of 
training samples. From the Malicious URLs database we 
used only 10,000 examples selected at random, to make the 
evaluation computationally feasible. Note, that we found that 
increasing the number of examples beyond 10,000 does nor 
result in a noticeable difference in the observed behavior. 

We also calculated the similarities between the models 
circulating in the network, using the cosine similarity measure. 
We calculated the similarity between all pairs of models, and 
calculated the average. This metric is useful to study the speed 
at which the actual models converge. Note that under uniform 
sampling it is known that all models converge to an optimal 
model. 

i) Modeling failure.: In a set of experiments we model 
extreme message drop and message delay. Drop probability 
is set to be 0.5. This can be considered an extremely large 
drop rate. Message delay is modeled as a uniform random 



TABLE I 

The main properties of the data sets, and the prediction error (0- 1 error) of the baseline sequential algorithm. In the case of 
Malicious URLs dataset the results of the full feature set are shown in parentheses. 





Reuters 


SpamBase 


Malicious URLs (10) 


Training set size 
Test set size 
Number of features 
Class label ratio 


2,000 
600 
9,947 
1,300:1,300 


4,140 
461 
57 

1,813:2,788 


2,155,622 
240,508 
10 

792,145:1,603,985 


Pegasos 20.000 iter 


0.025 


0.111 


0.080 (0.081) 



delay from the interval [A, lOA], where A is the gossip 
period in Algorithm [T] This is also an extreme delay, orders 
of magnitudes higher than what can be expected in a realistic 
scenario, except if A is very small. We also model realistic 
churn based on probabilistic models in [45 !|. Accordingly, we 
approximate online session length with a lognormal distribu- 
tion, and we approximate the parameters of the distribution 
using a maximum likelihood estimate based on a trace from 
a private BitTorrent community called FileList.org obtained 
from Delft University of Technology 1461 . We set the offline 
session lengths so that at any moment in time 90% of the 
peers are online. In addition, we assume that when a peer 
comes back online, it retains its state that it had at the time 
of leaving the network. 

B. Results and Discussion 

The experimental results for prediction without local voting 
are shown in Figures [T] and [J] Note that all variants can 
be mathematically proven to converge to the same result, 
so the difference is in convergence speed only. Bagging can 
temporarily outperform a single instance of Pegasos, but after 
enough training samples, all models become almost identical, 
so the advantage of voting disappears. 

In Figure [T] we can see that our hypothesis about the 
relationship of the performance of the gossip algorithms and 
the baselines is validated: the standalone Pegasos algorithm is 
the slowest, while the two variants of weighted bagging are the 
fastest. P2PEGASOSMU approximates WB2 quite wefl, wifli 
some delay, so we can use WB2 as a heuristic model of the 
behavior of the algorithm. Note that the convergence is several 
orders of magnitude faster than that of Pegasos (the plots have 
a logarithmic scale). 

Figure [T] also contains results from our extreme failure sce- 
nario. We can observe that the difference in convergence speed 
is mostly accounted for by the increased message delay. The 
effect of the delay is that all messages wait 5 cycles on average 
before being delivered, so the convergence is proportionally 
slower In addition, half of the messages get lost too, which 
adds another factor of about 2 to the convergence speed. Apart 
from slowing down, the algorithms still converge to the correct 
value despite the extremely unreliable environment, as was 
expected. 

Figure |2] illustrates the difference between the UM and 
MU variants. Here we model no failures. In Section IV-BI we 
pointed out that — although the UM version looks favorable 



when considering a single node — when looking at the full 
history of the learning process P2PEGASOSMU maintains 
more independence between the models. Indeed, the MU 
version clearly performs better according to our experiments. 
We can also observe that the UM version shows a lower level 
of model similarity in the system, which probably has to do 
with the slower convergence. 

In Figure [2] we can see the performance of the perfect 
matching variant of P2PEGASOSMU as well. Contrary to our 
expectations, perfect matching does not clearly improve per- 
formance, apart from the first few cycles. It is also interesting 
to observe, that model similarity is correlated to prediction 
performance also in this case. We also note, that in the case 
of the Adaline-based gossip learning implementation perfect 
matching is clearly better than random peer sampling (not 
shown). This means that this behavior is due to the context- 
dependence of the update rule discussed in IV-BI 

The results with local voting are shown in Figure [3] The 
main conclusion is that voting results in a significant improve- 
ment when applied along with P2PEGASOSRW, the learning 
algorithm that does not apply merging. When merging is ap- 
plied, the improvement is less dramatic. In the first few cycles, 
voting can result in a slight degradation of performance. This 
could be expected, since the models in the local caches are 
trained on fewer samples on average than the freshest model 
in the cache. Overall, since voting is for free, it is advisable 
to use it. 

VII. Conclusions 

We proposed gossip learning as a generic approach to learn 
models of fully distributed data in large scale P2P systems. The 
basic idea of gossip learning is that many models perform a 
random walk over the network, while being updated at every 
node they visit, and while being combined (merged) with 
other models they encounter. We presented an instantiation of 
gossip learning based on the Pegasos algorithm. The algorithm 
was shown to be extremely robust to message drop and 
message delay, furthermore, a very significant speedup was 
demonstrated w.rt. the baseline Pegasos algorithm due to the 
model merging technique and the prediction algorithm that is 
based on local voting. 

The algorithm makes it possible to compute predictions 
locally at every node in the network at any point in time, 
yet the message complexity is acceptable: every node sends 
one model in each gossip cycle. The main features that 




Fig. 1. Experimental results without failure (upper row) and with extreme failure (lower row). AF means all possible failures are modeled. 



differentiate this approach from related work are the focus 
on fully distributed data and its modularity, generality, and 
simplicity. 

An important promise of the approach is the support for 
privacy preservation, since data samples are not observed 
directly. Although in this paper we did not focus on this aspect, 
it is easy to see that the only feasible attack is the multiple 
forgery attack BtI . where the local sample is guessed based 
on sending specially crafted models to nodes and observing 
the result of the update step. This is very hard to do even 
without any extra measures, given that models perform random 
walks based on local decisions, and that merge operations are 
performed as well. This short informal reasoning motivates 
our ongoing work towards understanding and enhancing the 
privacy-preserving properties of gossip learning. 
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